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ABSTRACT 

More than 98% of a typical vertebrate genome does 
not code for proteins. Although non-coding regions 
are sprinkled with short (<200 bp) islands of evolu- 
tionarily conserved sequences, the function of most 
of these unannotated conserved islands remains 
unknown. One possibility is that unannotated 
conserved islands could encode non-coding RNAs 
(ncRNAs); alternatively, unannotated conserved 
islands could serve as promoter-distal regulatory 
factor binding sites (RFBSs) like enhancers. Here 
we assess these possibilities by comparing 
unannotated conserved islands in the human and 
mouse genomes to transcribed regions and to 
RFBSs, relying on a detailed case study of one 
human and one mouse cell type. We define trans- 
cribed regions by applying a novel transcript-calling 
algorithm to RNA-Seq data obtained from total 
cellular RNA, and we define RFBSs using ChlP-Seq 
and DNAse-hypersensitivity assays. We find that 
unannotated conserved islands are four times 
more likely to coincide with RFBSs than with 
unannotated ncRNAs. Thousands of conserved 
RFBSs can be categorized as insulators based on 
the presence of CTCF or as enhancers based on 
the presence of p300/CBP and H3K4me1. While 
many unannotated conserved RFBSs are transcrip- 
tionally active to some extent, the transcripts 



produced tend to be unspliced, non-polyadenylated 
and expressed at levels 10 to 100-fold lower than 
annotated coding or ncRNAs. Extending these 
findings across multiple cell types and tissues, 
we propose that most conserved non-coding 
genomic DNA in vertebrate genomes corresponds 
to promoter-distal regulatory elements. 

INTRODUCTION 

In completely sequenced vertebrate genomes, only ~1.5% 
of genomic DNA codes for proteins (1-3). An additional 
3.5% of the genome lacks coding sequences but is none- 
theless conserved across vertebrate phylogeny, strongly 
suggesting its functional importance (1-3). This con- 
served, non-coding 3.5% of the genome clusters into 
> 700 000 unannotated conserved islands, 90% of which 
are <200bp ('Materials and Methods' section). The vast 
majority of these conserved islands have no known 
function. 

Two possible functions for unannotated conserved 
islands are (i) to encode enhancers and other distal regu- 
latory sequences and (ii) to encode non-coding RNAs 
(ncRNAs). Indeed, tens of thousands of vertebrate 
conserved islands have already been found to overlap 
enhancers (4-7), which function at a distance to regulate 
the expression of associated genes. Most of these enhan- 
cers were identified in a genome-wide manner based on the 
presence of the co-activator p300/CBP and of H3K4mel- 
modifled histones (8). Similarly, ~10000 conserved islands 
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have been found to overlap promoters or exonic sequences 
of ncRNAs (9-1 1), whose function may be to act in cis or 
in trans to regulate gene expression (12-15). However, it 
remains unclear how many conserved islands will 
ultimately prove to have enhancer-related, ncRNA- 
related or other functions, since both enhancers and 
ncRNAs remain to be completely identified. Unfor- 
tunately, comprehensive identification of enhancers and 
ncRNAs will ultimately require genome-scale experiments 
to be conducted on all cell types in a vertebrate body, since 
both enhancers and ncRNAs function in subsets of tissues 
and cell types (5,11,16). 

A conceptual and experimental challenge in distinguish- 
ing whether a conserved island is an enhancer or a 
promoter of an ncRNA is that many enhancers produce 
short (<2kb) ncRNAs called enhancer RNAs (eRNAs) 
(7,17-20). Genomic sequence conservation at an 
enhancer has traditionally been thought to reflect the im- 
portance of regulatory factor binding sites (RFBSs), which 
recruit transcription factors initially to the enhancer and 
ultimately, through DNA looping, to associated promoters 
(21-23). However, the synthesis of eRNAs raises the pos- 
sibility that conservation of sequences at enhancers may 
also reflect their importance for promoting eRNA tran- 
scription or for encoding functions of eRNA transcripts. 
Since eRNAs, much like other ncRNAs, could in theory 
act in cis or in trans to regulate gene expression, any given 
enhancer could have important functions both as a trad- 
itional enhancer and as a promoter of a non-coding eRNA. 
Thus, neither the presence of RNA Polymerase II 
(RNAPII) nor evidence of transcriptional initiation at an 
unannotated conserved island is on its own sufficient to 
determine whether its sequence conservation reflects con- 
servation of enhancer function, conservation of ncRNA 
promoter function or both. 

Here we estimate, using genome-wide approaches, how 
many conserved islands function as enhancers (and other 
distal regulatory elements) and how many encode 
ncRNAs. We comprehensively define distal regulatory 
elements using ChlP-Seq and DNAsel-hypersensitivity 
(24) assays based on data from published sources (1,18). 
We also comprehensively define transcribed regions of the 
genome by applying a novel transcript calling algorithm 
to RNA-Seq data obtained from total cellular RNA. 
Applying these approaches to mouse cortical neurons 
and a human (HeLa) cell line, we find that whereas 
hundreds of unannotated conserved islands are trans- 
cribed in these cell types into ncRNAs, tens of thousands 
of unannotated conserved islands can be identified as 
distal regulatory elements. These conserved, promoter- 
distal regulatory elements are distinguishable from con- 
ventional ncRNA promoters based on the low expression 
level, and non-polyadenylated status of the transcripts 
they synthesize, as well as by their lack of the promoter- 
specific H3K4me3 mark (25-27). We find similar ratios of 
conserved ncRNAs to conserved distal regulatory 
elements when expanding our analysis to 10 different 
human cell lines. Our results suggest that the underlying 
reason for the conservation of most unannotated 
conserved bases in vertebrate genomes is their importance 
within promoter-distal regulatory elements. 



MATERIALS AND METHODS 

Our goal was to compare three kinds of genomic loci: 
conserved sequences, RFBSs and transcribed regions. 
To identify unannotated transcribed regions we developed 
a novel algorithm, Haar-wavelet Transcript Calling 
(HaTriC), which is described in the Supplementary 
Methods. Our strategy was to first identify each kind of 
locus and second to relate them to one another. 

Conserved islands 

The conserved islands were obtained from the PhastCons 
scores (as compared with 30 other vertebrates) (2) using a 
coarse-graining procedure that identified bins of at least 
10 bps where the average score was >0.9 (Supplementary 
Figure S3). Summary statistics for the conserved 
islands can be found in Supplementary Table S4 and 
Supplementary Figure S3. The assignment of conserved 
islands to various non-overlapping categories was 
carried out as follows: 

(1) If the conserved island overlapped an exon, then 
we categorized it as either an 'Exon of annotated 
protein-coding gene 1 or 'Exon of annotated ncRNA' 
depending on the coding potential of the gene. 

(2) If there was an annotated Transcription start site 
(TSS) within lkb of either end of the conserved 
island (and the conserved island did not overlap an 
annotated exon), it was classified as a 'Promoter of 
annotated protein-coding gene' or 'Promoter of 
annotated non-coding gene'. 

(3) If there was an enhancer or a RFBS within 100 bp of 
the conserved island, then we categorized it as an 
'Enhancer' or 'Other (unannotated) RFBS' conserved 
island. 

(4) If there was at least 33% overlap (Supplementary 
Figure S10) with a matrix attachment region (MAR), 
then the conserved island was classified as a 'MAR'. 

(5) All the remaining conserved islands were assigned to 
the Tntronic conserved island' conserved or the 
'Extragenic conserved island' category based on the 
overlap with annotated introns. 

Calculating how many conserved islands overlap exons 
of unannotated transcripts is complicated by the fact that 
HaTriC does not provide the exon-intron structure of 
transcribed regions. Hence, we used a statistical 
approach where we assumed that the unannotated tran- 
scribed regions have the same distribution of exon 
numbers, exon lengths and exon conservation as the 
long annotated ncRNAs (Supplementary Table S7). 
Using these assumptions, it is possible to estimate the 
number of conserved islands explained by unannotated 
transcripts in a given cell type (Figure 1 and 3A). 

Regulatory factor binding sites 

RFBSs were identified from publicly available DNAsel 
hypersensitivity and ChlP-Seq datasets. To understand 
how RFBSs are related to transcribed regions, we 
categorized them based on their proximity to promoters, 
enhancers, introns, exons and novel transcribed regions. 
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Figure 1. More conserved islands overlap enhancers and RFBSs than 
unannotated ncRNAs. The bars show the number of unannotated 
conserved islands that overlapped an enhancer, other promoter-distal 
RFBS or novel (HaTriC-defined) transcript (in mouse neurons). For 
Enhancers and RFBSs. a conserved island had to overlap an enhancer 
or RFBS to be counted. For novel ncRNAs, we used a statistical 
approach to estimate the number of conserved islands within exons 
of transcribed regions (Supplementary Methods). Lists of enhancer 
loci were taken from (5,18). RFBSs were defined as TF binding sites 
(mouse neurons) or DHSs (HeLa) that were not at promoters or 
enhancers. 

We classified RFBSs as conserved or non-conserved based 
on overlap with conserved islands. Each RFBS was 
assigned to a category according to the scheme outlined 
below, and the results are reported in Supplementary 
Table S3. We start with the full set of peaks, and once a 
peak has been assigned to a category, it cannot be assigned 
to any further categories. 

(1) If the peak was within lkb of an annotated TSS, it is 
considered 'Promoter of annotated protein-coding 
gene' or 'Promoter of annotated ncRNA' for 
annotated coding genes and ncRNAs, respectively. 

(2) If the peak was within lkb of an enhancer it is clas- 
sified as either 'Intragenic enhancer' or 'Extragenic 
enhancer'. 

(3) If the peak was within lkb of the start of a novel 
transcribed region (identified by HaTrIC, but not 
present in annotation) it is assigned to the 
'Promoter of novel ncRNA' category. 

(4) If the peak overlaps an annotated protein-coding 
gene but is further than lkb from the start, it is 
assigned to the 'Overlaps exon of annotated 
protein-coding gene' or 'Overlaps intron of 
annotated protein-coding gene' depending on its 
overlap with exons. Similarly, peaks overlapping 
annotated ncRNAs are considered either 'Overlaps 
exon of annotated ncRNA' or 'Overlaps intron of 
annotated ncRNA'. 

(5) If the peak does not fit into any of the previous 
categories it is classified as 'Unannotated extragenic'. 

Transcribed regions 

To understand transcription across the genome, we 
combined annotation, de novo transcript-calling and a 
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locus 





Pprppn op 


No. of 


Pprpjpntn ctp 




nf RNA-Sen 














Protein-coding gene 


"71 A Z 1 

/ 1 Ad I 


i o 1 no 
tz lUo 


21.2/ 


Annotated non-coding 


1.381 


2564 


0.92 


gene 








snKJNAs, tKJNAs, scKJNAs, 


26.465 


3625 


0.01 


srpKNAs, tKJNAs 








Promoter AS transcript 


U. d 34 


4544 


ft £1 


Other (HaTric-defined) AS 


0.038 


660 


0.11 


transcript 








Novel (HaTric-defined) 


0.076 


255 


0.08 


transcript 








Extragenic eRNA 


0.013 


622 


0.04 


Intragenic eRNA 


0.008 


331 


0.01 


Other RFBSs-associated 


0.062 


793 


0.04 


RNA 








Associated with other 


0.017 


367 


0.01 


H3K4me3 peaks 








Total 


99.8643 


26169 


23.1147 



The vast majority of RNA-Seq reads in mouse neurons fall within 10 
categories of genomic loci (rows), identified and classified using a com- 
bination of gene annotation, HaTriC transcript calling, and chromatin 
state (Supplementary Methods). Here categories of expressed loci are 
characterized based on their fraction of the total number of RNA-Seq 
reads, their number of genomic loci and their fraction of genomic 
base-pairs. Transcribed loci were required to have nine RNA-Seq 
reads and a read density of at least 1 per kb. Annotated gene categories 
include UTRs and introns. Annotated non-coding genes include those 
annotated in the UCSC, RefSeq, Ensembl, lincRNA and macroRNA 
collections, excluding snRNAs, tRNAs, scRNAs, srpRNAs and 
rRNAs. A 'Promoter AS transcript' is an AS transcript with its 
5'-end within 2kb of an annotated TSS. An 'Other (HaTriC-defined) 
AS transcript' is an AS transcript (overlapping an annotated gene) with 
its 5'-end further than 2 kb from any annotated TSS. An 'Other 
RFBS-associated RNA' starts within 2kb of a RFBS not identified 
as an enhancer. snRNAs, tRNAs, scRNAs, srpRNAs and rRNAs are 
defined by repeatMasker. We note that rRNAs are under-represented 
here relative to within a cell due to their removal from total RNA 
samples by hybridization prior to sequencing. Similar results for 
HeLa cells are presented in Supplementary Table S8. 

targeted search near enhancers and RFBSs (Table 1) to 
produce a set of transcribed regions. To characterize the 
transcriptome, we assigned each read uniquely to a 
transcribed region. To define the transcribed regions in 
the most accurate way possible, as reported in 
Supplementary Table S7, Supplementary Figure S4 and 
Table 1 in the main text, we combined the annotation, 
the HaTriC transcript-caller, and a targeted search close 
to enhancers and RFBSs. Below, we describe how each 
category of transcribed regions was defined, as well as 
the criteria for assigning reads to each category. We con- 
sidered the categories sequentially, and at each step we 
identified (and removed from further analysis) all reads 
that overlapped regions in the current category. 

(1) Annotated protein-coding genes were first separated 
into non-overlapping clusters. From each cluster the 
longest region, g h was extracted as a representative 
of that cluster (this was done to avoid double 
counting, and the majority of clusters contain only 
one gene). If the average read density of g t fell below 
a threshold, the region was ignored and the reads 
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were retained and made available for inclusion in 
another category. 

Next, we applied HaTriC and merged the identified 
regions that had been categorized as corresponding 
to a part of a gene or uniquely to a gene with their 
overlapping genes. When two regions g t and g- are 
merged, they are removed and a new region g t is 
created. The new region contains the union of the 
reads from g t and g' i , and it extends from the 
5'-most end of g t and g\ to the 3'-most end of g t 
and g'j. The transcribed regions gj were frequently 
longer than the annotation would have predicted. 

(2) Having removed all reads corresponding to 
annotated coding genes, we carried out the same pro- 
cedure for annotated non-coding genes. 

When counting the number of reads in the two 
categories relating to annotated genes (as reported 
in Table 1, but not for the transcript read density 
reported in Supplementary Figures S4, S7 and 
Figure 2), we also assigned all sense reads found 
within 10 kb upstream or downstream of g ; - to the 
(protein-coding or ncRNA) genie category. As 
reported by van Bakel et al. (28), these regions 
often have a read density that is above the back- 
ground levels found in more distal regions. 

(3) Next, we searched for promoter AS transcribed regions, 
i.e., divergent transcribed regions (29,30). We started by 
searching all windows located 2kb upstream of all 
annotated TSSs. If a window contained >r 0 reads, it 
was considered significant. Most transcribed regions 
that were not detected by HaTriC are <2kb (see 
Supplementary Figure S4), but to account for longer 
regions we extended the search to the next 2kb 
window upstream of the TSS if the TSS proximal 
window contained >r 0 reads. Additional windows 
were investigated until a window containing <r 0 reads 
was found. For a set of adjacent 2kb windows, the 
length of the transcribed region is defined as the 
maximum distance between all pairs of reads found 
in these windows. We refer to the procedure where sub- 
sequent 2 kb windows are scanned as a 'window-based 
search'. The threshold was set to r 0 = 9 reads in a 2 kb 
window for the mouse neurons and r 0 = 5 reads for 
the HeLa cells, corresponding to an FDR of 0.001. 
The regions detected using the window-based search 
were merged with all unannotated regions proximal 
to known genes identified by the transcript caller. 

(4) This category corresponds to long unannotated tran- 
scripts and hence we assign all regions categorized by 
HaTriC as unannotated and distal to known genes to 
this class. Since there are occasionally low numbers 
of reads close to the starts and ends of the 
unannotated transcribed regions (similar to how 
promoter AS reads are found near annotated 
TSSs), we carried out a window-based search 
upstream and downstream of the transcribed 
regions. Any reads found from the window-based 
search was included in the total read count 
reported in Table 1 and Supplementary Table SI. 

(5) We first applied the window-based search to the AS 
strand downstream of all RFBSs overlapping 



annotated genes. The regions obtained using the 
window-based method are then merged with the 
ones found by HaTriC and categorized as anti-sense 
(AS) with respect to known genes. 

(6) For extragenic enhancers, we applied the 
window-based search in the downstream directions 
on both strands. For intragenic enhancers, only the 
AS downstream window was considered. The regions 
obtained using the window-based method are merged 
with all eRNA regions identified by HaTriC. 

(7) Since the H3K4me3 mark is strongly associated with 
active promoters, we wanted to make sure that we 
did not miss any significant transcription initiated 
from these loci. For all extragenic RFBSs that were 
within 2kb of a H3K4me3 peak, we used the 
window-based method on both strands to extract a 
set of transcribed regions. 

(8) For the remaining extragenic RFBSs that did not 
have a H3K4me3 peak nearby, we again used the 
window-based method on both strands to extract a 
set of transcribed regions. 

(9) Finally, for HeLa cells where we also have access to 
CTCF data, we applied the window-based method 
on both strands at CTCF peaks. 

The Supplementary Methods contain details on how the 
annotations of the mouse and human genomes were 
assembled. There is also a list of all the datasets used in 
this study (Supplementary Table S6). As far as possible, 
given the availability of datasets, we performed parallel 
analyses in mouse neurons and human HeLa cells. 



RESULTS 

Assigning reads to transcribed regions 

Our strategy to understand the function of conserved 
elements required a comprehensive accounting of the tran- 
scriptome, including an ability to comprehensively identify 
non-coding transcripts and distinguish ncRNAs from 
protein-coding genes. To achieve this understanding, we 
developed an algorithm to define transcribed regions of 
the genome de novo (i.e., without relying on annotation) 
using short-read RNA-Seq data. We applied this algo- 
rithm to strand-specific RNA-Seq from mouse cortical 
neurons and HeLa cells. Although other RNA-Seq 
studies have applied de novo transcript detection in both 
yeast (31) and mouse (11,32), with a few exceptions [e.g., 
(28,33,34)], these and other studies of gene expression 
have only detected or defined mature polyadenylated tran- 
scripts. Here we consider polyadenylated transcripts as 
well as non-polyadenylated transcripts, since unannotated 
conserved sequences could give rise to either type of 
transcript. 

To define transcribed regions of the genome de novo, we 
developed a computational approach, Haar-wavelet 
Transcript Calling (HaTriC). HaTriC is an iterative algo- 
rithm that combines evidence from multiple length scales 
to determine if a given region is actively transcribed (see 
Supplementary Methods for full description). Each 
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Figure 2. The transcriptional profile at ucRFBSs is more similar to that of enhancers than promoters. Comparison of several properties of 
transcribed regions (as defined in 'Materials and Methods') that overlap at least one conserved island. TSSs were defined by annotation for annotated 
genes or by HaTriC for unannotated genes. Only expressed loci are included; thresholds for defining expressed loci were nine RNA-Seq reads and a 
read density of at least 1 per kb. (A) Transcribed regions at enhancers and ucRFBSs are short and expressed at lower levels than annotated genes, as 
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iteration of the HaTriC algorithm involves the following 
three steps: (i) At each genomic locus, the change in 
RNA-Seq read density between upstream and down- 
stream regions is computed [as Haar-wavelet coefficients, 
similar to (35)]. (ii) The genomic loci with the biggest 
changes in RNA-Seq density are selected. These loci rep- 
resent a set of candidate boundaries of transcribed 
regions, (hi) Each candidate transcribed region, defined 
as the sequence between two candidate boundaries, is clas- 
sified as either transcribed or not transcribed based on its 



" 1C 



10 



o 
o 



10V 



10' 



• ncRNAs (H3K4me3-defined) 
A RFBS (DHS-defined) 
T Insulator (CTCF-defined) 



AA 



AAA 



▼ 



T ▼ T 



TTTT 
• •• 



#cell types 



,x 10 



-o 7 



^■ncRNAs (H3K4me3-defined) 
□□RFBS (DHS-defined) 
I I Insulator (CTCF-defined) 



S 4 

O) 

c 

a. 3 

CL 

ro 

§ 2 
o 

o 1 




100 200 500 1000 1500 

#cell-types or conditions 

Figure 3. Across many cell types, more conserved islands overlap 
RFBSs than ncRNAs. (A) Number of novel ncRNAs, ucRFBSs, and 
insulators discovered with each additional tissue or cell type 
investigated. The number of conserved islands assigned to each 
category initially increases as a power-law. (B) Extrapolation to add- 
itional cell types of the number of unannotated conserved islands ex- 
plained by ncRNAs. unannotated RFBSs and insulators (based on the 
slopes in the left panel). Assuming that there are a total of 100, 200, 
500, 1000 or 1500 distinct cell-types or conditions, we calculated the 
total number of conserved islands that would overlap ncRNAs, insu- 
lators or other RFBSs. 



average read density. This classification is straightforward, 
since the densities of the candidate regions have a bimodal 
distribution, with the higher density mode corresponding 
to transcribed regions (Supplementary Figure SI). By 
applying the above procedure iteratively, excluding the 
regions that were called as transcribed in previous 
rounds, we are able in each successive round to detect 
transcribed regions with lower read density. As regions 
with high read density are removed, the distribution of 
candidate regions' read densities goes from bimodal to 
unimodal, at which point the iteration terminates 
because no additional transcribed regions are detected. 
The parameters for the algorithm ('Materials and 
Methods' section) are optimized by maximizing the 
fraction of transcribed regions (on one strand of one 
chromosome) that have transcriptional initiation marker 
H3K4me3-modified histones (25-27) bound at their start. 

To identify transcribed regions, we applied HaTriC to 
RNA-Seq reads obtained from sequencing ribosomal 
RNA-depleted total RNA from mouse neurons (~140 
million reads) (18) and HeLa cells (~50 million reads). 
We obtained ~ 10 000 transcribed regions in each cell 
type (Supplementary Table SI). These transcribed 
regions do not necessarily represent specific RNA tran- 
scripts [as do the transcripts defined in some other 
studies, e.g., (11)] but can instead correspond to multiple 
overlapping transcripts synthesized from the same strand. 
For example, the RNA-Seq reads falling within a 
transcribed region that corresponds to a gene typically 
reflect pre-mRNA transcripts (corresponding to reads 
aligning both to introns and exons) as well as mature 
mRNA transcripts (corresponding to exons only). The 
purpose of this approach is to identify regions that are 
transcribed rather than to define precisely the exon- 
intron structure of specific transcripts. 

To evaluate the quality of transcribed regions identified 
by HaTriC, we compared transcribed regions with 
annotated protein-coding genes. Comparing the trans- 
cribed regions with the RefSeq (36), UCSC (37) and 
Ensembl (38) gene annotations, we found, in accordance 
with another recent study (28), that most transcribed 
regions either overlap coding genes on the correct strand 
(76%) or are found within 10 kb of the start of an 
annotated coding gene (18%), despite the fact that these 
two sets of regions together represent just 15% of the 
genome (Supplementary Table SI). The majority of the 
transcribed regions that overlap annotated genes (78%) 
match the gene annotation in an unambiguous manner 
(Supplementary Methods, Supplementary Table SI). 
These results demonstrated sufficient accuracy in calling 
transcribed regions to allow us to begin to categorize 
regions lacking any annotation. 



Figure 2. Continued 

shown by the average expression near TSSs in each category. For sites without obvious strand orientation (e.g. H3K4me3 sites), forward and reverse 
(rev) genomic strands are plotted separately; otherwise, only sense reads are plotted. (B) ucRFBSs and enhancers are associated with fewer 5' ends of 
5'-sequenced ESTs than promoters. (Note the different j-scales.) (C) The ratio of polyA+ to total RNA reads is much lower at enhancers and 
ucRFBSs relative to annotated RNAs. The x-axis is the ratio of normalized polyA+ reads divided by the number of normalized total RNA reads at a 
locus, and the v-axis is the cumulative density (CDF). (D) ucRFBSs and enhancers are expressed at lower levels than annotated RNAs. 
(E) Transcribed regions at ucRFBSs and enhancers are shorter than those at protein-coding genes. (F) ucRFBSs and enhancers are not bound 
by the initiation-specific H3K4me3 mark. (G) Genomic sequence conservation at promoters extends outward further than genomic sequence con- 
servation at ucRFBSs and enhancers. (H) The CpG content at ucRFBSs and enhancers is lower than that at promoters. 
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Few conserved islands correspond to unannotated ncRNAs 

We asked to what extent novel, HaTriC-defined ncRNAs 
coincide with evolutionarily conserved sequences. To 
identify the conserved sequences, we segregated the 
genome into conserved and non-conserved regions using 
a simple coarse-graining procedure performed on 
PhastCons conservation scores derived from alignments 
of 30 vertebrate species (2) ('Materials and Methods' 
section). In each genome (human and mouse), we 
identified 1 million conserved islands, most of them 
<200bp (Supplementary Figure S3). We define -300000 
of these as annotated conserved islands based on their 
overlap with the promoters or exons of annotated 
coding or non-coding genes (RefSeq, UCSC, Ensembl, 
IncRNA and macroRNA annotations combined 
(9,11,36,37,38), Supplementary Table S4). The remaining 
700 000 (presumed non-coding) conserved islands we 
define as unannotated conserved islands (55% extragenic, 
45% intragenic). 

We addressed how many unannotated conserved islands 
might encode promoters or exons of novel extragenic 
ncRNAs. Our strategy was to asssess the overlap of 
unannotated conserved islands with expressed sequences. 
To identify novel ncRNAs, we applied HaTriC and found 
~200 ncRNAs that were already annotated in RefSeq, 
Ensembl, UCSC, the macroRNA or lincRNA datasets 
and ~250 unannotated extragenic ncRNAs (Sup- 
plementary Table SI) (Although 200 annotated ncRNAs 
may seem like a small number to find, this result is con- 
sistent with the reported lower levels of expression and 
greater tissue specificity of ncRNAs (39), especially those 
lacking strong experimental support.). Annotated and 
novel ncRNAs accounted for 2% of transcribed regions 
called by HaTriC and an estimated 3% of the total 
number of RNA-Seq reads but <0.01% of the length of 
the genome (Table 1 and Supplementary Table SI). Given 
the small number of novel ncRNA loci that are actively 
transcribed in either mouse neurons or HeLa cells, rela- 
tively few unannotated conserved islands are estimated to 
function as promoters (<400) or exons (< 1 800) of novel 
extragenic ncRNAs in these cells (Figure 1 and 'Materials 
and Methods' section). These estimates depend on as- 
sumptions about the gene structure of novel ncRNAs 
and the overlap of their exons with conservation islands, 
since HaTriC does not define intron-exon structure 
directly. Our assumptions are based on the gene structure 
of annotated ncRNAs in RefSeq, UCSC and Ensembl. A 
recently published survey of human lincRNAs (39) 
provided an independent set of gene structures that have 
less exonic overlap with conserved islands (Supplementary 
Table S7). Thus, our above estimates may overemphasize 
the extent of conservation explained by novel ncRNAs. 

We also addressed how many unannotated conserved 
islands might encode novel non-coding AS transcripts, 
which have been proposed to regulate sense gene expres- 
sion (40-43). Previously described AS transcripts 
include < 2 kb promoter AS transcripts (34,44,45), 
synthesized upstream of genie promoters; <2kb eRNAs 
from intronic enhancers (18,19); and other, sometimes 
longer AS transcripts (46). It remains unclear how 



common these transcripts are, how highly expressed they 
are relative to annotated genes, and how frequently they 
overlap unannotated conserved islands. We found a sub- 
stantial number of AS transcribed regions, accounting for 
15% of all transcribed regions detected by HaTriC 
(Supplementary Table SI). Most AS transcribed regions 
correspond to promoter AS transcripts (Table 1). Because 
AS transcripts are generally short (Supplementary Figure 
S4c,d), the total fraction of the genome transcribed into 
AS in these cell types is small. Thus, in vertebrate genomes 
as in yeast (44,45), AS transcription is composed predom- 
inantly of short (<2kb), lowly expressed transcripts 
synthesized from promoters. Accordingly, AS transcripts 
do not explain the function of many unannotated 
conserved islands, since AS transcripts originate predom- 
inantly from conserved islands that are already annotated 
as promoter regions. 

Our analysis suggests that few unannotated conserved 
islands encode ncRNAs or serve as their promoters. One 
reason we could be underestimating the overlap of conser- 
vation with ncRNAs is that our ability to detect ncRNAs 
spanning the 45% of conserved islands that are intronic is 
limited to detection of anti-sense ncRNAs. This limitation 
arises because our method does not allow us to distinguish 
specific ncRNAs that overlap pre-mRNAs or mRNAs on 
the same strand. However, the similar numbers of 
H3K4me3 binding sites at extragenic and intronic loci 
(Supplementary Table S3) suggest that this limitation is 
unlikely to result in a dramatic revision to our findings. A 
second potential reason that we could be underestimating 
the extent of overlap between conserved islands and 
ncRNAs would be if there were additional, less highly 
expressed novel ncRNAs not detected by HaTriC. Like 
any algorithm for identifying transcribed regions or tran- 
scripts, HaTriC has the least statistical power to detect 
and define lowly expressed transcripts, especially if the 
transcripts are short. For example, few eRNAs are 
detected by HaTriC (Supplementary Table SI). To 
evaluate the extent of this challenge, we asked how 
many RNA-Seq reads could be explained by different 
sources of annotation, including annotated transcripts 
HaTriC-defined transcripts, and transcripts associated 
with enhancers or other promoter-distal RFBSs (Table 1 
and Supplementary Methods). We found that 99.8% of 
reads can be explained by these combined sources 
of genome annotation. Of the remaining <0.2% of un- 
accounted reads, >99% could be explained by a 
negative binomial background model, suggesting a low 
level of technical noise (Supplementary Figure S2). 

We cannot fully rule out the alternative possibility that 
these reads are derived from tens of thousands of very 
lowly expressed ncRNAs. We estimate, however, that 
the expression levels of such transcripts would be < 1 tran- 
script per 100 cells (Supplementary Methods). This 
estimate, detailed in the supplement, is based on a refer- 
ence point of 240 000 mRNAs per cell (47). While we note 
that this reference point is imprecise, our overall copy 
number distributions accord well with those obtained 
using reference points based on digital in situ hybridiza- 
tion (48) (data not shown). Even if our copy number esti- 
mates were off by an order of magnitude, these results 
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imply that if the unexplained reads do not represent tech- 
nical noise, they may well represent biological (transcrip- 
tional) noise (28,49). 

Many conserved islands overlap promoter-distal 
regulatory sites 

Having observed that few unannotated conserved islands 
appear to coincide with ncRNAs, we investigated an al- 
ternative hypothesis, that most unannotated conserved 
islands function as promoter-distal RFBSs. Such sites 
include enhancers, insulators and silencers (1,50). Each 
of these types of RFBS is marked by DNAsel Hyper- 
sensitivity Sites (DHSs), since the relatively open chroma- 
tin associated with RFBSs is vulnerable to DNAsel 
digestion (1,51,52). Thus, to identify RFBSs, we 
examined DHSs [Supplementary Table S2a (1)]. We 
found that DHSs overlapped ~9000 unannotated 
conserved islands, suggesting that a large number of 
unannotated conserved islands function as RFBSs. This 
overlap is highly significant (P-value <10 -16 , hypergeo- 
metric test where the intersection is 9000 between 
700000 conserved islands and 60000 DHSs, assuming a 
total of 30 000 000 potential loci), consistent with the idea 
that sequence conservation at overlapping unannotated 
conserved island/DHS loci reflects the importance of regu- 
latory factor binding to DNA. 

In theory, DHSs should correspond to all RFBSs bound 
in a given cell type, but in practice any DHS experiment will 
miss some RFBSs. In an independent approach to finding 
RFBSs, we turned to ChlP-Seq experiments to identify 
binding sites for a number of different regulatory factors. 
The factors immunoprecipitated were NPAS4, CREB, 
SRF and CBP in mouse neurons (18) and AP2a, AP2y, 
MAX, cFOS, cMYC, E2F4 and E2F6 in HeLa cells (1). 
Analysis of the HeLa data revealed that ~90% of the 
binding sites for these factors overlapped a DHS, indicating 
that the DHSs represent a sensitive means of detecting 
RFBSs (Supplementary Table S2b) (53). Unsurprisingly 
in light of this high degree of overlap, ~ 12 000 conserved 
RFBSs identified by factor binding were distributed across 
promoters, enhancers and unclassified RFBSs in much the 
same way as those identified by DNAse hypersensitivity 
(Figure 1; Supplementary Table S2a and S3a). Thus, 
regardless of the method used to identify RFBSs, many 
more unannotated conserved islands overlap with RFBSs 
than with ncRNAs. Moreover, the finding that a much 
larger fraction of the non-coding genomic elements serve 
as binding sites for regulatory factors rather than exons or 
promoters of ncRNAs is true for non-conserved parts of 
the genome as well. For the mouse neurons, we estimate 
that there are ~40000 non-conserved enhancers and 
RFBSs, compared with only 2700 promoters and exons 
for ncRNAs. 

The large number of unannotated conserved islands 
found to overlap RFBSs led us to investigate what regu- 
latory function unannotated conserved RFBSs (ucRFBSs) 
might serve. Insulators, which represent one major class of 
regulatory site, are involved in partitioning active and 
inactive regions of the genome (50,54). We asked how 
many ucRFBSs identified in HeLa cells were insulators, 



defined by the presence of the protein CCCTC-binding 
factor (CTCF). We found that -3 000 ucRFBSs could 
be classified as insulators on this basis (last three rows in 
Supplementary Table S3b). 

Although the remaining ucRFBSs are distal both to 
annotated and HaTriC-defined promoters, we considered 
the possibility that they might be weak, unannotated 
ncRNA promoters not detected by HaTriC. We found 
that <500 ucRFBSs could be classified as promoters 
based on the presence of the promoter-specific H3K4me3 
mark (Figure 2F, Supplementary Table S3c). Nonetheless, 
a larger number of ucRFBSs could represent inactive pro- 
moters that drive high levels of transcription in other cell 
types. We investigated this possibility by examining 
sequence, conservation and expression profiles of 
ucRFBSs (Figure 2). First, we reasoned that if ucRFBSs 
act as promoters in any tissue, they should share sequence 
characteristics with known promoters. Contrary to this 
prediction, ucRFBSs (like enhancers) lack high densities 
of CpG dinucleotides that are frequently found at coding 
promoters (Figure 2H). Second, we reasoned that pro- 
moters and promoter-distal regulatory sequences might 
be distinguished based on their extent of conservation. 
We found that the lengths of conserved islands at 
ucRFBSs more closely resemble those found at enhancers 
than those found at promoters, again suggesting that most 
ucRFBSs do not act as promoters in any tissue or cell-type 
(Figure 2G). Finally, if ucRFBSs act as promoters in any 
tissue, they should be enriched for 5' ends of (5'-sequenced) 
ESTs (37), which have been sequenced at low depth from a 
wide variety of different tissues. However, the overlap 
between ucRFBSs and ESTs is similar to that observed 
between ESTs and previously defined enhancers and is 
dramatically less than the overlap between 5' EST ends 
and promoters (Figure 2B). Moreover, only promoters of 
protein-coding genes have more spliced than unspliced 
ESTs (Supplementary Figure S9). These results suggest 
that ucRFBSs do not act as conventional coding or 
ncRNA promoters. 

We hypothesized that many of the remaining ucRFBSs 
might represent enhancers that were missed by enhancer- 
identification algorithms (5,8,18) because their level of 
enrichment of p300/CBP or H3K4mel was below the 
chosen threshold. To investigate this possibility, we 
examined CBP and H3K4mel binding at ucRFBSs in 
mouse neurons. The majority of ucRFBSs had low but 
significant levels of CBP and H3K4mel enrichment, sug- 
gesting that many of these unclassified sites could indeed 
be enhancers (data not shown and Supplementary 
Figure S5). However, a subset of these ucRFBSs may rep- 
resent regulatory sites that are not enhancers. We 
conclude that ucRFBSs are a mixture of insulators, en- 
hancers and perhaps other classes of promoter-distal regu- 
latory sites (e.g., locus-control regions and silencers). 

Regulatory sites express non-polyadenylated, unspliced 
transcripts at 10 to 100-fold lower levels than mRNAs 

Our case studies of HeLa and mouse neuron cells sug- 
gest that most unannotated conserved islands overlap 
promoter-distal RFBSs (e.g. enhancers), rather than 
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ncRNAs. However, recent genome-wide studies (18,19) 
have found that certain promoter-distal RFBSs (i.e. enhan- 
cers) are associated with low levels of transcription, 
producing ncRNAs. These results blur the distinction 
between promoter-distal regulatory sites and ncRNA pro- 
moters, raising questions about how well these two classes 
of loci can truly be distinguished. However, we find that 
that these enhancers and ucRFBSs can be clearly distin- 
guished experimentally from ncRNA promoters for the 
following reasons. First, while both enhancers and 
ucRFBSs produce RNAs, these RNAs are expressed at 
much lower levels than those produced from traditional 
coding or non-coding promoters (Figure 2A,D, P-value 
<10~ , KS-test on distributions from Figure 2). The 
approximate expression levels of transcripts emanating 
from ucRFBSs averaged <1 transcript per 100 cells 
(Supplementary Figure S4a,b, Supplementary Methods), 
or ~10 to 100-fold lower than the copy number of an 
average genie mRNA. Second, relative to ncRNAs, the 
transcripts emanating from promoter-distal RFBSs are 
less likely to be spliced or polyadenylated (18,19) 
(Figure 2B,C,S9, P-value <10" 16 , KS-test on distributions 
from Figure 2). Third, while the length distributions of 
ncRNAs and ucRFBSs overlap, ucRFBSs are shorter 
(<2kb) than the typical lincRNA (P-value <10" 2 , 
KS-test on distributions from Figure 2). Fourth, 
sequence conservation at ucRFBSs in most cases does 
not extend beyond the specific location where regulatory 
factors bind, whereas coding and ncRNAs promoters 
exhibit conservation over a longer genomic region 
(Figure 2G). Thus, conserved distal regulatory sites 
differ significantly from ncRNA promoters based on 
their chromatin and transcriptional profiles. 

As more cell types are examined, more ucRFBSs 
are found 

To the extent that we can ascribe functions to unan- 
notated conserved islands in our case study of one 
mouse and one human cell type, the ascribed functions 
are overwhelmingly related to binding of regulatory 
factors to DNA, with relatively few unannotated 
conserved islands corresponding to ncRNAs (Figure 1). 
However, our case study explains only ~20000 out of 
700000 unannotated conserved islands, presumably 
because the remainder function only in cell types (or 
cellular conditions) other than those examined here. 
Since we are only able so far to investigate a small 
number of cell-types, our predictions about the functional 
role of the majority of conserved islands requires an 
extrapolation of our findings to additional cell types. In 
this extrapolation, how many additional unannotated con- 
served islands may be attributed to RFBSs and ncRNAs 
as more cell types are examined will depend on the relative 
cell-type specificity of RFBSs and ncRNAs. 

To extrapolate how many additional conserved islands 
could be ascribed to conserved unannotated RFBSs 
(ucRFBSs) and ncRNAs as additional cell types are 
examined, we identified ncRNAs (using H3K4me3- 
binding) and ucRFBSs (using DHS sites) in 10 additional 
human cell lines using data from the ENCODE project (1) 



('Materials and Methods' section). As we examined add- 
itional cell types, we discovered novel ucRFBSs and 
ncRNAs at similar rates (Figure 3A), implying that 
ucRFBSs and ncRNAs have similar cell-type specificity. 
However, this conclusion relies on the accuracy of using 
H3K4me3 to identify ncRNA promoters. To confirm that 
unannotated H3K4me3 loci are a reasonable proxy for 
ncRNA promoters, we used HaTriC to identify 800 novel 
ncRNAs in 10 human tissues, using RNA-Seq data 
generated from total RNA (Supplementary Table S5). 
The approximate number of novel ncRNAs found from 
each additional ENCODE cell line (using H3K4me3) or 
human tissue (using HaTriC) was similar (MOO), suggest- 
ing that unannotated H3K4me3 sites are a reasonable 
proxy for ncRNA promoters [as was previously found by 
others (11)]. In fact relatively weak H3K4me3 sites are fre- 
quently found in locations where no transcription can be 
detected by RNA-Seq (Figure 2, cyan line), suggesting that 
our method of promoter identification by H3K4me3 may 
over-represent the number of ncRNA promoters. Because 
the rate of additional ncRNAs and RFBSs found with each 
new cell type examined is roughly equal, examining add- 
itional cell types does not radically alter our conclusion that 
more unannotated conserved islands have RFBS-related 
than ncRNA-related function. 

We considered finally how many unannotated 
conserved islands might ultimately be assigned to RFBS- 
related or ncRNA-related functions once RFBSs and 
ncRNAs have been identified in all vertebrate cell types. 
The adult human body contains ~400 cell types (55). 
However, this number is likely to be an underestimate in 
the sense that it does not account for rare (and unknown) 
adult cell types, developmental stage-specific cell types, or 
for the ability of cells to adopt different gene expression 
and chromatin states depending on environmental cues. 
The true number is very difficult to approximate, but it 
is nonetheless instructive to extrapolate how conserved 
ncRNA and RFBS discovery might scale with increasing 
numbers of cell types. We therefore arbitrarily took 2 000 
as a potential upper limit on the number of cell types 
across human development and adulthood. Extrapolating 
our findings using the ENCODE cell lines to an estimated 
1000-2000 cell type-conditions, we find it to be plausible 
that 20% of conserved islands encode ncRNAs or their 
promoters, whereas 80% of conserved islands function as 
promoter-distal RFBSs (Figure 3B). Even though there 
are many uncertainties associated with this estimate, it is 
to the best of our knowledge the first quantitative attempt 
to address the question of how many conserved 
non-coding sequences function as distal regulatory 
elements versus encode ncRNAs. We expect that access 
to RNA-Seq and ChlP-Seq data from additional 
cell-types, as well as more accurate multi-species align- 
ments will ultimately allow similar approaches to 
produce more accurate estimates. 

DISCUSSION 

The function of the ~60% of conserved bases in verte- 
brate genomes that are non-coding is unknown (1,3). 
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We provide genome-wide evidence suggesting that the 
reason most of these DNA bases are conserved across 
vertebrate evolution is due to their importance as 
promoter-distal regulatory elements, including enhancers. 
In individual cell types (HeLa cells or mouse neurons), we 
find that 4-fold more conserved islands are associated with 
promoter-distal regulatory elements rather than with 
ncRNAs. As we examine additional cell types, we find in 
each additional cell type four times as many conserved 
islands corresponding to promoter-distal regulatory 
elements as to ncRNAs. Extrapolating these results to 
an estimated 1500 cell type-conditions in a vertebrate 
body, it is plausible that almost ~300000 unannotated 
conserved islands function as promoter-distal regulatory 
elements, while ~60000 encode functions related to 
ncRNAs. At the same time, we cannot rule out the alter- 
native possibility that many or even most unannotated 
conserved islands may have novel, as-yet unknown func- 
tions that are unrelated to RFBSs, ncRNAs, or even gene 
expression. For example, we used the H-rule (56) to 
estimate (Supplementary Figure S10) that 4% of 
conserved islands could correspond to MARs, which 
anchor chromatin to the nuclear matrix (57). 

When characterizing the transcriptome, a crucial 
experimental parameter is the total number of reads 
captured in the experiment, since the extent of sequencing 
determines one's ability to detect lowly expressed tran- 
scripts. To evaluate how the sequencing depth affects 
our results, we down-sampled the RNA-Seq data. 
For mouse neurons >90% of the novel ncRNAs are 
discovered using only 50% of the original reads 
(Supplementary Figure S8), and extrapolation suggests 
that additional novel transcripts would be discovered at 
a rate of ~2 regions/ 10 million reads. Most of the regions 
that would be discovered from additional sequencing 
would express transcripts at even lower abundance than 
the regions that have been discovered at the current 
sequencing depth. 

This low degree of extragenic transcription is in contrast 
with the results reported in a recent study by Mercer et al. 
(58). Using Capture-Seq, a combination of microarray 
capture and RNA-Seq, they identified 257 novel tran- 
scripts from 0.77 Mb of human fibroblast genome. 
Extrapolation of their result suggests that there could be 
~1 million lowly expressed extragenic transcripts in the 
entire genome, whereas extrapolation of our results 
suggests a number that is around three orders of magni- 
tude lower. This difference is likely due in large part to 
their much higher sensitivity, which allows them to detect 
transcripts expressed at an average of 0.0006 mRNAs 
per cell. 

Detection and expression level thresholds are crucial to 
understanding the ongoing debate about the extent of 
intergenic transcription. Initial estimates based on tiling 
arrays (1) suggested that the majority of the genome was 
transcribed in at least one cell type. Even though this 
finding has been challenged (28), other more recent 
studies (58,59) have argued that there is ubiquitous tran- 
scription, albeit at levels as low as an average of 0.0006 
transcripts per cell. At the given sequencing depth, where 
we estimate our detection threshold at about one mRNA 



per cell, our findings are consistent with van Bakel el al. 
(28). Thus, while higher detection sensitivity allows for 
identification of further transcripts, these transcripts are 
expressed at very low levels. The weaker the expression 
level of a protein-coding or ncRNA, the less it tends to 
be conserved (60), suggesting that deeper the transcrip- 
tome is sequenced, the less overlap identified transcripts 
will have with evolutionary conservation. 

Despite most of the genome not being transcribed, 
we do find thousands of short regions that produce low 
levels of transcripts associated with promoters, insula- 
tors, enhancers and other regulatory sites. These tran- 
scripts tend to be unspliced and non-polyadenylated 
(Figure 2B,C), suggesting that they do not leave the 
nucleus. Even if these transcripts themselves are not func- 
tionally important (49), the act of transcribing them may 
nonetheless be necessary. For example, much of the 
observed ncRNA transcription may be important for es- 
tablishing and maintaining chromatin states such as 
histone acetylation or methylation [e.g., (61,62)]. In this 
case, the low levels of associated transcripts may reflect 
either the low stability of the non-functional transcripts 
that are produced or the low frequency of transcriptional 
initiation required for chromatin maintenance. 

We propose that most sequence conservation in the 
non-coding genome reflects the importance of RFBSs. 
This proposition may appear to conflict with the observa- 
tion that binding of certain regulatory factors (CEBPA, 
HNF4A) in liver tissue is poorly (<10% of sites) 
conserved between any two mammals (63). However, it 
has recently been argued that many of non-conserved 
binding sites are less likely to be functional and that the 
degree of conservation is significantly increased if expres- 
sion of nearby genes is taken into consideration (64). 
More broadly, the extent of conservation of binding that 
is observed in any particular experiment likely depends on 
the particular tissues or regulatory factors examined, as is 
suggested by the higher conservation of binding of the 
transcription factor Twist across Drosophila species (65). 
Thus, in order to determine conclusively whether a par- 
ticular conserved genomic sequence is associated with 
conserved binding, it will be necessarily to examine 
multiple bound factors at that locus across multiple 
tissues. Together our findings suggest that while 
many RFBSs come and go over an evolutionary time 
scale, a core set of conserved RFBSs account for most 
of the non-coding sequence conservation in vertebrate 
genomes. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online: 
Supplementary Tables 1-8, Supplementary Figures 1-10, 
Supplementary Methods and Supplementary References 
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